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6 Introduction to CKIP Parts of 
Speech System 


This part is an extension of CKIP (Chinese Knowledge Information Processing) 
group’s POS Analysis of Contemporary Chinese (“CKIP_POS”) originally 
published in 1986. Following three years of extensive study and analysis of 
over 40,000 lemmas in the Mandarin Daily Dictionary (Guo YuRibao) a revised 
version of CKIP_POS was published in 1989. The next step was the development 
and introduction of the Information-based Case Grammar by CKIP for natural 
language parsing (ICG; Chen and Huang 1990). Based on ICG, we built a Chinese 
electronic lexicon that consisted of approximately 80,000 lemmas, each with their 
lexical classes, phonetic annotations, frequencies, semantic classification, etc. 

This publication addresses the need to provide users and the public with 
a full account of our POS classification framework and its criteria, a keenly 
felt need since the release of the full electronic lexicon to the public in 1992. 
However, unlike CKIP_POS or its revision, the book features not only traditional 
explanations, definitions and examples for each headword, but also additional 
criteria for classification in each lexical entry, especially those for borderline 
cases. Moreover, the example sentences in this book are extracted primarily from 
the Sinica Corpus (Chen et al. 1996) and represent authentic language use. 

Prior to a detailed discussion of PoS classifications, Section 6.1 will give an 
introductory account of the lexical entries as well as the tagset in our lexicon; 
Section 1.2, introduces the syntactic features. 


6.1 Word and its POS Tag in the CKIP Lexicon 


The entries in the CKIP lexicon include not only words,! but also sub-lexical 
units smaller than words, as well as phrases and idioms. However, compounds 
that are highly productive or that can be derived based on grammatical rules are 
not included in the lexicon. For instance, the Determiner-Classifier compounds 
and Replicative words can be treated compositionally with morphological rules, 
hence they are not included, POS information is assigned to each lexical entry. 
For instance, tag A is attached to non-predicative adjectives (see Chapter 8); tag I 
is attached to interjections (see Chapter 14). General tagging principles will be 
discussed after Chapter 6. In the following section, only the criteria for specific 
tag assignments are introduced. 
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6.1.1 Annotation Guidelines for Bound Morphemes 


In addition to free lexical entries, there are numerous bound components (mor- 
phemes) that have to co-occur with other independent components. Currently, 
we mark these components with a b feature. E.g. 7 ‘wood’ in BIAS ‘tree’ 
and AX #4 ‘wood’; X ‘to state’ in fi ‘describe’ and iW Hk ‘duty report’; 
JŲ ‘style’ in FEIN ‘program’ and IÑ T ‘formulae’. Some di-syllabic bound 
components can also be found, such as ME RE ‘endless’ in EEI BE ‘endless 
squander’ and RÍK E ‘endless demand’; or 44 J ‘without conviction’ T. 
YEAS JJ ‘work without conviction’ or #44 us Jj ‘execute without conviction’. 
It is worth noting that the same word form in a lexical entry can function both 
as a word with a POS tag and an bound morpheme tagged with a feature. For 
example: 


= ‘sound’ 


1 (Nfi) Classifier that modifies the action verbs. For example, 1'|—%# ‘called 
once’, W —# ‘yelled once’. 

2  (b) For example, $i) 222% ‘sound of piano’, #% ‘sound of bell’, Ht ‘sound 
of drum’, #2 ‘sound of impact’. 


4 is tagged as Classifier (Nfi) in 1, and as productive suffix in 2. Without 
tagging with b, the automatic POS tagger will yield incorrect results. 


6.1.2 Annotation Guidelines for Sentences 


In addition to words, there exist 12 ‘sentences’ in the CKIP Lexicon. Most of 
these sentences are fixed expressions. Syntactically, they are well-structured with 
Subject, Verb and Object; semantically, they are independent units requiring 
no argument participation (e.g. A HEA HY jhi ‘Don’t air your dirty laundry 
in public’, X = 4K H & ‘A letter from home is as good as gold’, X XX 
A AS SHE MS WY RE ‘Each family has its own difficulties’). Currently, the tag 
S is attached to them, indicating their sentential attribute and distinguishing 
them from other POS taggers. However, not every phrase or idiom is tagged 
with S. If these phrases and idioms can be used as verb, then priority 
will be given to stative intransitive verb (VH11), such as Ek $f HE ‘price 
is skyrocketing; literally “rice priced like pearl, and firewood priced like 
cinnamon", $k A BX ‘tall woods reaching the sky’, as seen in the following 
sentences. 


1 EJLA KRI IE ‘Taipei’s product prices are skyrocketing.’ 
2 =3As— TAKE ‘The area around San Tiago is verdant with tall woods 
reaching the sky.’ 
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6.1.3 Annotation Guidelines for Determiner—Measure Compounds 


Since Determiner and Classifier/Measure can productively form a DM compound 
in a compositional manner, which renders exhaustive enumeration impossible, we 
have proposed a set of morphological rules to deal with DM compounds. As a 
result, DM compounds (e.g. — AX, #41) are not included in the lexicon, except 
when they also carry other semantic functions, as shown in the following: 


ia tk - Dh ysl), Nfe (E st zk) ‘this way, such’ 3 
Æ - Nea (Hh 4 ail), Nfd GE He x\) ‘SanChung (a town in New Taipei City)’ 
= - Dbab (EAH Billa), Nfzz GE st zk) ‘At any rate’. 


ey) 


6.1.4 Annotation Guidelines for Reduplicated Words 


Reduplicated words (e.g. Hi bt 4 4% (from R4% ‘happy’) , #1] from FT ‘hit’), 
like DM compounds, can be derived using morphological rules; hence, they are 
not included in the lexicon. Two exceptions are as follows: 


e Words with reduplicated form yet without corresponding morphological rules 
of derivation. E.g. Nouns with replicated forms JAJAL rN FS ‘winds and rains; 
turbulent times’,* 34344 ‘things and objects, all things big and small’; 
Adverb with replicated forms ‘i y ‘often’; Words in reduplicated form both 
without corresponding un-reduplicated roots: [ra hi HEHE ‘busy and bustling’, 
HH ‘gradually’, XX A ‘each and every family’.° 

e Words with reduplicated forms yet their syntactic/semantic behavior cannot 
be predicted of morphological rules. These words will be listed separately in 
the lexicon and assigned with multiple POS tags. E.g. tI ‘guai guai’ 


1 Naa (Material nouns, FOOD, a brand of rice snacks) 
2  VHI1I (Stative intransitive verbs, replicative form of fe) 


6.1.5 Annotation Guidelines for Verb—Complement Compounds 


The definition of Verb-Complement (VC) Compounds in Chinese remains a 
topic of ongoing research. Basically, a VC compound consists of at least two 
predicative morphemes. The verb component usually refers to action, while the 
complement component refers to results or directions (e.g. #] iJ ‘to open’,, fil 
3K ‘to run over here’). For all of its straightforward structure, a VC’s syntactic 
and semantic properties are complicated. Syntactically, some VC compounds 
are composed of two intransitive verbs, but the two intransitives combine into 
a transitive when compounded (e.g. [Æ] ‘walk’ + [KJ ‘break’ > IEI ‘wear 
to broken by walking’ ((t xe Pk — £E $t) ‘He walked through a pair of shoes’). 
Other VC compounds comprise one transitive and one intransitive, but the two 
verbs combine into an intransitive when compounded e.g. [XË] ‘pour + I] 
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‘drunk’ -> EWY ‘to cause (someone) to be drunk by forcing him/her to drink’ (Gk 
= HEM J Æ) ‘Zhangsan made Lisi drunk by making him drink excessively’. 
Although the meaning of most VC compounds can usually be inferred from their 
composing parts, such as IZ fy ‘eattrice, to eat a meal’, / fi ‘drink+drunk, to 
get drunk’; some cannot, such as (116) and (117), where in (116) Æ ‘look’+ 3K 
‘come’ the meaning derivation is not transparent; and & ‘contain’+ F ‘down’ 
meaning derivation is not transparent in (117). 


(116) fh FERK R 
tā kangilai hénhao 
‘S/He looks great.’ 


(117) Her H AR ZN 
chēzi kě róngxià èrrén 
‘The car can accommodate two people.’ 


Given the productivity and the non-compositionality of the VC compounds, 
their sub-classes cannot be exhaustively listed in the lexicon. Currently, we do 
not assign them to any specific subclasses. They are assigned with the VR feature 
instead. We also allow a VC compound to carry the semantic features of its two 
components: verbs and complements, from which the syntactic and semantic 
features of the VC can be inferred. For instance, we found that verbs carrying 
movement features can be combined with complements carrying direction 
features. This is cognitively sound in that the motion implies the moving direction. 
Therefore, we have a morphological rule V[+movement]+R[+direction]|>VR, 
which correctly predicts that action verbs such as [H] [+movement] can 
combine with directional complements [E] [+upward], [ F] [+downward], 
[if | [inward], [H] [+outward] to form grammatical VR compounds such as 
(ra kJ DEF] TRE] [i8 ], whereby the compositional meaning can be 
inferred as well. We plan to study how to predict a VC’s syntactic behavior based 
on these features in the future. 


6.2 POS Annotation 


There are eight major POS classes in the CKIP Lexicon—verbs, non-predicative 
adjectives, nouns, adverbs, prepositions, connectives, particles, and interjections. 
Other than non-predicative adjectives and interjections, all the POS classes are 
further divided into sub-classes based on their semantic and syntactic behaviors 
(see Appendix). For instance, nouns are classified into material nouns, individual 
nouns, individual abstract nouns, abstract nouns, and collective nouns. Verbs 
are first classified into action and stative verbs, then further into subclasses 
such as intransitive verbs, quasi-transitive verbs, di-transitive verbs, sentential 
object verbs, verb-phrase object verbs, etc. More details will be addressed in the 
following sections. 
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Due to the lack of morphological markers, the difficulty we have often 
encountered when analyzing the data is that the same lemma can play different 
syntactic roles. For example, a verb can often serve as the main verb but also the 
noun-modifier, as in if {i #k 77 ‘to evaluate a report’ or ‘an evaluation report’, 
HTH 2 ‘beautiful girl’; many can occur in the position of nouns (so-called 
nominalized verbs) as in (118). Similarly, nouns can function as a modifier, such 
as PARIR ‘apple-face’, and also act as a predicative , as in (119). 


(118) fh fo GHA HRE RE W aR 
tā de diàochá xianshicht butong de jiéguð 
he DE investigate reveal out different DE result 
‘His investigation showed a different result.’ 


(119) Wh 4 WA hh W HŽ 
ta hén baobéi ta de toufa 
she very treasure/baby her DE hair 
‘She treasures/pampers her hair.’ 


The distinction is also made between a word with polyfunctionality and a word 
where multiple syntactic categories are assigned. These two are treated differently 
in our framework based on observations of their actual use in large corpora. 
Syntactic features are attached to the former, which will be illustrated in detail 
in section 6.2.1; section 6.2.2 will address the conditions under which a word will 
be analyzed in terms of multiple syntactic categories. 


6.2.1 Polyfunctionality of Words 


Some of the syntactic categories in Chinese serve polyfunctionally in various 
contexts, but are consistent in syntactic behaviors; therefore, certain syntactic 
features are given instead of different POS taggers, in the hope of facilitating 
the parsing task in natural language processing. Four constructions are discussed 
as follows: 

Firstly, most nominals and verbs in Chinese can serve as modifiers; however, 
we do not assign them a multiple POS function, but specify the syntactic 
information within the representation model of nominals in Information-based 
ICG Grammar (ICG, Chen and Huang 1990), as shown in (X): 

Secondly, a large number of simple verbs and verbs followed by DE (#9) or DI 
GÈ), stative verbs in particular, can serve as the manner-adverb of the main verb 
in a sentence. For instance, [ JJ] ‘dedicate’ is the main verb in [MBR EJ | 
‘S/He is dedicated’ ; while in | (th7 AHN VE] ‘S/He works dedicately’, it 
serves as the modifier of the main verb [ I-/E | ‘work’; another example is | Je 
Ph] ‘emotionally’ in | {theses FIRR] ‘S/He cried emotionally’. 

In the above cases, we do not assign different POS tags to the word, but 
annotate it with features such as +way or +de, etc (Wei 1991). One can use 
these features in the parser when deciding between ‘main verb’ or ‘modifier’, 
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or during the analysis of Non-Predicative adjective (A) or determiner-measure 
(DM) compounds, as shown in (X) - (X). 


(120) fe} = dR ASE (A)t+de 
tamen feifa rujing 
‘They entered the country illegally.’ 


(121) fh —(l—(— 8% (DC)+way 
tā yigéyigé shu 
‘He counts one by one.’ 


Thirdly, the time nouns in Chinese often serve as temporal modifiers (Chang, 
1988), such as in (X). In many English dictionaries, words like ‘tomorrow’ have 
two syntactic classes: noun and adverb. In our framework, only the nominal tag is 
assigned to ‘HHK (tomorrow)’. Although nouns and adverbs differ considerably 
in occurring positions and syntactic functions, time nouns in Chinese often form 
a larger temporal unit with temporal noun phrases to modify the whole sentence. 
The information of nouns carrying temporal features will be submitted to the 
parser so as to identify the role of modification without needing to assign multiple 
POS tags. 


(122) fh HKR AN K 
tā mingtian bù lái 
‘He won’t come tomorrow.’ 


Fourthly, verb are often nominalized in Chinese. Chinese verbs frequently serve 
as nominals, sharing their syntactic properties when modified by DM compounds 
(Tang 1989) (e.g. (b EIR SERN HS — JAWI ‘S/He argues that the two research 
(projects) should be completed’). Though the verb is nominalized in this case 
(Yeh, et al. 1992), we annotate it with syntactic features rather than with different 
PoS, both for reducing the complexity during automatic PoS assignment and for 
a deeper grasp of the intriguing interactions between verbs and nominals. It is 
important to note that in nominalization, although the syntactic behavior has 
changed, the argument structure is preserved. For instance, nominalized verbs 
still inherit the original argument structure, such as [RZE] | ‘to identify with’ has 
two arguments (THEME and GOAL), and these two arguments are retained in 
nominalization, and differ only in their realization forms, as illustrated by a and b 
(Yeh et al. 1992).° 


b.Bd A HEGRE RACER 


Figure 6.1 Shared argument mapping of deverbal nouns 
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6.2.2 Multiple Syntactic Classification of Words 


The following guidelines are proposed as the conditions under which multiple 
syntactic classes will be assigned to words. 


1 


Homonyms or homographs 
Multiple assignments will be applied to homonyms, that is, words coinciden- 
tally sharing the same form while having different senses. 

For instance, the word form [| Œ | has distinct senses that fit into differ- 
ent syntactic classes (‘again’:adverb/‘heavy’:stative verb); other examples 
include [ ® | (‘meeting’ noun; ‘will’ verb), [ HZ] (‘only if? conjunction; 
‘only’ adverb), etc. Please note again that polysemous words will not be 
assigned with multiple POS tags. 

Common nouns lose their referring function, then acquire their verbal 
characteristics. 

For example, | W | (‘oil’ vs. ‘greasy’, (123)), T x | (‘fire’ vs. ‘mad’, (124)) 
and | #¥ H | (‘baby’ vs. ‘pamper’, (125)) are assigned with stative intransitive 
and stative transitive tags, as can be seen in 6.2.2. 


(123) È ffi X R W 
zhè zhong cai hěn you 
‘This type of dish is quite greasy.’ 

(124) E Zh R K 
wang laoshi hěn hud 
‘Teacher Wang was very angry.’ 

(125) BR MHE Rh K Rž 
chén xidojie hén băobèi ta de toufa 
“Ms. Chen pampers her hair.’ 


Word forms with clear distinctions in both syntactic function and semantic 
content. 

For example, | 47/2] in (126a) and (126b) are tagged with a sentential 
adverb ‘in consequence’ and a common noun ‘result’, respectively; [mÙ] 
in (127a) and (127b) have different meanings while functioning as the adverb 
‘just’ and conjunction ‘although’. Hence, these words will be assigned with 
multiple POS tags, due mainly to their specific behaviors. 


(126) (a) Mi fh FE OO Ot 
jiēguð ta shénme yé bù shud 
‘In consequence, he said nothing.’ 
(b) fh mě AR T 
tā zhidao jiégud le 
‘He knew the result.’ 
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(127) (a) fi Pa eK A BR 
ta buguo chi ni yikou pinggud 
‘He just had a bite of your apple.’ 
(b) Mi, w RZ $ HE 
búguò, tā huánméi măn èrshísuì 
‘Although, he is not yet 20.’ 


Following this guideline, a word form is assigned with up to four PoS tags 
in our lexicon. For instance, the wordform [#4] is assigned with four tags: 
abstract nouns (— lili) ‘one dot’; action verbs with single object (fi 3) ‘to 
order dishes’; proximate classifiers (~ti #5) ‘some opinion’; temporal 
classifier ( FF ZAN) ‘three pm’, each with distinctive syntactic behaviors. 


Notes 


1 Regarding the definition of Chinese wordhood, please refer to Tang (1989: 9). 

2 For ease of reference, a concise description of a Part-of-Speech tag set can be found in 
Appendix I. 

3 For DM compounds, only the heads (i.e. classifiers) are POS-tagged. Nfc, Nfd and Nfzz 
thus stand for different types of classifiers. Please refer to section 9.1.6. 

4 Unlike verbs and adjectives, there are no regular noun reduplications in Chinese and 
reduplicated forms are lexically determined. 

5 Note that there are no corresponding free word forms (i.e. [WIR], DI] [AA ] in 
Chinese lexicon). 

6 Currently, we use the [+argument] feature to mark the verbs that take the arguments after 
being nominalized (Yeh et al. 1992). This practice is under review. This feature has been 
discarded in favor of the [+NV] feature. 


